Skip to content

implement file splitting functionality and enhance documentation#51

Merged
whhe merged 27 commits intooceanbase:mainfrom
Zhangg7723:powerrag_sdk_api
Feb 25, 2026
Merged

implement file splitting functionality and enhance documentation#51
whhe merged 27 commits intooceanbase:mainfrom
Zhangg7723:powerrag_sdk_api

Conversation

@Zhangg7723
Copy link
Collaborator

Summary

  • Added split_file and split_file_upload methods to support file chunking via local paths, URLs, and uploads.
  • Updated README.md to include detailed examples for text and file splitting methods.
  • Enhanced error handling for unsupported parsers and invalid file inputs.
  • Introduced tests for file splitting functionalities to ensure reliability.

Solution Description

Zhangg7723 and others added 24 commits December 31, 2025 16:55
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@dosubot dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jan 28, 2026
@dosubot
Copy link

dosubot bot commented Jan 28, 2026

Documentation Updates

2 document(s) were updated by changes in this PR:

Markdown Processing and Chunking
View Changes
@@ -1,12 +1,21 @@
-PowerRAG provides a robust system for processing markdown documents, supporting multiple chunking strategies to optimize retrieval-augmented generation (RAG) workflows. The system is designed to preserve document structure, handle complex markdown elements, and manage chunk sizes for downstream tasks such as embedding or LLM input.
+PowerRAG provides a robust system for processing documents, supporting multiple chunking strategies to optimize retrieval-augmented generation (RAG) workflows. The system is designed to preserve document structure, handle complex markdown elements, and manage chunk sizes for downstream tasks such as embedding or LLM input.
 
 ### Architecture and Chunking Strategies
 
-The core service for markdown chunking is `PowerRAGSplitService`, which exposes a unified interface for splitting text using different strategies, selected via the `parser_id` parameter. Supported strategies include:
+The core service is `PowerRAGSplitService`, which exposes two primary capabilities:
 
-- **Title-based chunking**: Splits content at markdown headers of a specified level, preserving section boundaries and returning both chunk content and associated titles.
-- **Regex-based chunking**: Splits text using a configurable regex pattern, then merges or further splits chunks based on token thresholds.
-- **Smart chunking**: Uses an AST-based approach to parse markdown structure, intelligently chunking by headings, containers (lists, tables), and token counts.
+1. **Text Splitting** (`split_text`): For markdown/text content using three specialized parsers
+2. **File Splitting** (`split_file`, `split_file_upload`): For files using all available ParserType methods
+
+#### Text Splitting
+
+The `split_text` method supports three specialized parsers for markdown and text content, selected via the `parser_id` parameter:
+
+- **Title-based chunking** (`title`): Splits content at markdown headers of a specified level, preserving section boundaries and returning both chunk content and associated titles.
+- **Regex-based chunking** (`regex`): Splits text using a configurable regex pattern, then merges or further splits chunks based on token thresholds.
+- **Smart chunking** (`smart`): Uses an AST-based approach to parse markdown structure, intelligently chunking by headings, containers (lists, tables), and token counts.
+
+**Note**: Only these three parsers are supported for `split_text`. For other parsers (such as `naive`, `book`, `qa`, `paper`, etc.), use the file splitting methods.
 
 Example usage:
 ```python
@@ -31,6 +40,71 @@
 
 Smart chunking parses the markdown document into an abstract syntax tree (AST) using `MarkdownIt`. It recursively processes AST nodes, treating headings as chunk boundaries and preserving containers such as lists, tables, and code blocks. Chunks are merged or split based on token counts and document structure. Large chunks are split first by headings, then by newlines, ensuring each chunk is close to the target token size and titles are preserved as prefixes.
 
+#### File Splitting
+
+The `split_file` and `split_file_upload` methods support all available ParserType methods, providing comprehensive file chunking capabilities. These methods work with local files, file URLs, and file uploads, supporting various document types including PDFs, Office documents, images, and HTML.
+
+**Supported ParserType Methods**:
+- **Basic parsers**: `naive`, `title`, `regex`, `smart`
+- **Specialized parsers**: `qa`, `book`, `laws`, `paper`, `manual`, `presentation`
+- **Format-specific parsers**: `table`, `resume`, `picture`, `one`, `email`
+
+The file splitting methods internally initialize a file chunker factory (`_init_file_chunker_factory`) that maps each ParserType to its corresponding chunking module from the `rag/app` and `powerrag/app` packages.
+
+**Usage Examples**:
+
+Using a local file path:
+```python
+service = PowerRAGSplitService()
+result = service.split_file(
+    filename="/path/to/document.pdf",
+    parser_id="book",
+    config={"chunk_token_num": 512, "delimiter": "\n。.;;!!??"}
+)
+```
+
+Using a file URL:
+```python
+result = service.split_file(
+    filename="https://example.com/doc.pdf",
+    binary=None,  # Binary will be downloaded
+    parser_id="naive",
+    config={
+        "chunk_token_num": 256,
+        "max_file_size": 128 * 1024 * 1024,  # 128MB
+        "download_timeout": 300,  # 5 minutes
+        "head_request_timeout": 30  # 30 seconds
+    }
+)
+```
+
+Using file upload (via API):
+```python
+# Read file binary
+with open("document.pdf", "rb") as f:
+    binary = f.read()
+
+result = service.split_file(
+    filename="document.pdf",
+    binary=binary,
+    parser_id="book",
+    config={"chunk_token_num": 512}
+)
+```
+
+**Configuration Parameters for File Splitting**:
+- `chunk_token_num`: Target chunk size in tokens (default: 512)
+- `delimiter`: Delimiters for splitting large chunks (default: `"\n。.;;!!??"`)
+- `lang`: Language for processing (default: `"Chinese"`)
+- `from_page`: Starting page number for PDF processing (default: 0)
+- `to_page`: Ending page number for PDF processing (default: 100000)
+- `max_file_size`: Maximum file size for URL downloads in bytes (file URL only)
+- `download_timeout`: Download timeout in seconds for file URLs (file URL only)
+- `head_request_timeout`: HEAD request timeout in seconds for file URLs (file URL only)
+
+The file splitting methods return chunks as a list of strings, along with metadata including the parser ID, total chunk count, and filename.
+[Source](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386)
+
 ### Handling Markdown Elements
 
 Markdown elements, especially images, are carefully preserved during chunking. In smart chunking, image nodes are reconstructed using their `alt` and `src` attributes to produce the correct markdown syntax (`![alt](src)`). This ensures that image source links are not lost during chunking, addressing previous bugs where image sources were dropped in smart chunks [PR #11](https://github.com/oceanbase/powerrag/pull/11).
@@ -47,7 +121,9 @@
 
 ### Customizing and Extending Chunking Behavior
 
-Chunking behavior can be customized by selecting the appropriate `parser_id` (`title`, `regex`, `smart`) and configuring parameters such as:
+#### Text Splitting Configuration
+
+For text splitting with `split_text`, chunking behavior can be customized by selecting the appropriate `parser_id` (`title`, `regex`, `smart`) and configuring parameters such as:
 
 - `title_level`: Markdown header level for splitting (title-based chunking).
 - `chunk_token_num`: Target chunk size in tokens.
@@ -63,14 +139,37 @@
     config={"chunk_token_num": 256, "min_chunk_tokens": 64}
 )
 ```
+
+#### File Splitting Configuration
+
+For file splitting with `split_file` or `split_file_upload`, all ParserType methods are available. Configuration parameters vary by parser but commonly include:
+
+- `chunk_token_num`: Target chunk size in tokens (default: 512)
+- `delimiter`: Delimiters for splitting large chunks
+- `lang`: Language for processing
+- `from_page`, `to_page`: Page range for PDF processing
+- `max_file_size`, `download_timeout`, `head_request_timeout`: URL download settings
+
+Example using the `book` parser:
+```python
+result = service.split_file(
+    filename="/path/to/book.pdf",
+    parser_id="book",
+    config={"chunk_token_num": 512, "lang": "Chinese"}
+)
+```
 [Source](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386)
+
+### Extending Chunking Logic
+
+To extend text chunking logic, implement a new chunker function and register it in the `CHUNKER_FACTORY` mapping within `PowerRAGSplitService`. Ensure your chunker accepts a configuration dictionary and returns chunks in the expected format. You may also customize AST node handling to support additional markdown elements or protected regions.
+
+To add support for new file parsers, implement a chunking module following the pattern of existing modules in `rag/app` or `powerrag/app`, and register it in the `_file_chunker_factory` mapping during initialization.
+
+### Binary File Parsing
 
 The system also supports parsing binary files (PDF, Office documents, images, HTML) into markdown, returning the markdown content, images (as base64), and metadata. The parsing configuration can specify layout recognition engines, formula and table recognition, and page ranges for PDFs [PR #40](https://github.com/oceanbase/powerrag/pull/40).
 
-### Extending Chunking Logic
-
-To extend chunking logic, implement a new chunker function and register it in the `CHUNKER_FACTORY` mapping within `PowerRAGSplitService`. Ensure your chunker accepts a configuration dictionary and returns chunks in the expected format. You may also customize AST node handling to support additional markdown elements or protected regions.
-
 ---
 
-For further details, refer to the [split_service.py implementation](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386) and relevant [pull requests](https://github.com/oceanbase/powerrag/pull/11).
+For further details, refer to the [split_service.py implementation](https://github.com/oceanbase/powerrag/blob/a97000b728952b4bb42d01a2fc672b07bd0da6ec/powerrag/server/services/split_service.py#L41-L1386) and relevant pull requests: [PR #11](https://github.com/oceanbase/powerrag/pull/11), [PR #40](https://github.com/oceanbase/powerrag/pull/40), [PR #51](https://github.com/oceanbase/powerrag/pull/51).
PowerRAG SDK
View Changes
@@ -9,6 +9,7 @@
 - Parse documents to Markdown format, including direct binary parsing for PDF, Office documents, images, and HTML ([source](https://github.com/oceanbase/powerrag/pull/40)).
 - Asynchronous and synchronous document parsing, with status polling and cancellation.
 - Manage document metadata, download content, and handle document chunks.
+- Split text and files into chunks using various parser methods, including support for local files, file URLs, and file uploads.
 
 ### Knowledge Base Management
 - Create, update, list, and delete chat sessions and agents.
@@ -104,6 +105,20 @@
 - `use_kg`: Use knowledge graph (bool)
 - `toc_enhance`: Table of contents enhancement (bool)
 
+### File Splitting
+- `parser_id`: Parser method ID (str)
+  - For text splitting (`split_text`): Only supports `title`, `regex`, `smart`
+  - For file splitting (`split_file`, `split_file_upload`): Supports all ParserType methods including `naive`, `title`, `regex`, `smart`, `qa`, `book`, `laws`, `paper`, `manual`, `presentation`, `table`, `resume`, `picture`, `one`, `email`
+- `chunk_token_num`: Target chunk size in tokens (int, default 512)
+- `delimiter`: Delimiter string (str, default `"\n。.;;!!??"`)
+- `lang`: Language (str, default `"Chinese"`)
+- `from_page`, `to_page`: Page range for PDFs (int, default 0 and 100000)
+- `file_path`: Local file path (str, optional, for `split_file`)
+- `file_url`: Remote file URL (str, optional, for `split_file`)
+- `max_file_size`: Maximum file size in bytes for URL downloads (int, optional, default 128MB)
+- `download_timeout`: Download timeout in seconds (int, optional, default 300)
+- `head_request_timeout`: HEAD request timeout in seconds (int, optional, default 30)
+
 ## Usage Examples
 
 ### Create a Dataset and Upload Documents
@@ -189,12 +204,82 @@
     print(chunk.content)
 ```
 
+### Split Text into Chunks
+```python
+# Text splitting only supports: title, regex, smart
+result = client.chunk.split_text(
+    text="# Chapter 1\n\nThis is the content of chapter 1.\n\n# Chapter 2\n\nThis is chapter 2.",
+    parser_id="title",  # Only: title, regex, or smart
+    config={"chunk_token_num": 512}
+)
+
+print(f"Total chunks: {result['total_chunks']}")
+for chunk in result['chunks']:
+    print(chunk)  # chunks are strings
+```
+
+### Split Files into Chunks
+```python
+# Method 1: Split file from local path (server must have access to the path)
+result = client.chunk.split_file(
+    file_path="/path/to/document.pdf",
+    parser_id="book",  # Supports all ParserType methods
+    config={
+        "chunk_token_num": 512,
+        "delimiter": "\n。.;;!!??",
+        "lang": "Chinese",
+        "from_page": 0,
+        "to_page": 100
+    }
+)
+
+# Method 2: Split file from URL
+result = client.chunk.split_file(
+    file_url="https://example.com/document.pdf",
+    parser_id="naive",
+    config={
+        "chunk_token_num": 256,
+        "max_file_size": 128 * 1024 * 1024,  # 128MB
+        "download_timeout": 300,
+        "head_request_timeout": 30
+    }
+)
+
+# Method 3: Upload file and split
+result = client.chunk.split_file_upload(
+    file_path="/path/to/local/document.pdf",
+    parser_id="book",
+    config={"chunk_token_num": 512}
+)
+
+print(f"Total chunks: {result['total_chunks']}")
+print(f"Filename: {result['filename']}")
+print(f"Parser used: {result['parser_id']}")
+for chunk in result['chunks']:
+    print(chunk)  # chunks are strings
+```
+
+**Supported ParserType methods for file splitting:**
+- Basic: `naive`, `title`, `regex`, `smart`
+- Professional: `qa`, `book`, `laws`, `paper`, `manual`, `presentation`
+- Special formats: `table`, `resume`, `picture`, `one`, `email`
+
+**Return value structure:**
+```python
+{
+    "parser_id": "book",
+    "chunks": ["chunk1", "chunk2", ...],  # List of strings
+    "total_chunks": 10,
+    "filename": "document.pdf"
+}
+```
+
 ## Integration Guidelines
 1. Install the SDK via pip.
 2. Import `PowerRAGClient` from `powerrag.sdk`.
 3. Initialize the client with your API key and server URL.
-4. Use resource objects (`dataset`, `document`, `chat`, `agent`) and their methods for all operations.
-5. Configure advanced options as needed for parsing, retrieval, and chat/agent creation.
+4. Use resource objects (`dataset`, `document`, `chat`, `agent`, `chunk`) and their methods for all operations.
+5. Configure advanced options as needed for parsing, retrieval, file splitting, and chat/agent creation.
 6. Handle exceptions as raised by SDK methods for error management.
 7. Refer to type annotations and docstrings for IDE assistance.
 
@@ -208,3 +293,69 @@
 
 ## License
 PowerRAG SDK is licensed under Apache-2.0 ([source](https://github.com/oceanbase/powerrag/pull/27)).
+
+## Frequently Asked Questions
+
+### What is the difference between text splitting and file splitting methods?
+
+The SDK provides three different methods for chunking content:
+
+**`split_text`**: Text-only splitting
+- Only supports three parser methods: `title`, `regex`, `smart`
+- Designed for plain text or Markdown content
+- No file handling required
+
+**`split_file`**: File splitting via path or URL
+- Supports all ParserType methods (15+ parsers)
+- Can process files from local paths (`file_path`) or remote URLs (`file_url`)
+- Server must have access to local paths when using `file_path`
+
+**`split_file_upload`**: Upload and split
+- Supports all ParserType methods (15+ parsers)
+- Uploads file from local system to server before splitting
+- Best for local files when server doesn't have direct access
+
+**When to use each method:**
+
+Use `split_text` when:
+- You have plain text or Markdown content
+- You only need `title`, `regex`, or `smart` parsers
+- You don't have a file to process
+
+Use `split_file` when:
+- You need parsers other than `title`, `regex`, or `smart` (e.g., `book`, `qa`, `naive`, `paper`)
+- The file is accessible via a URL
+- The file is on the server's filesystem (accessible via `file_path`)
+
+Use `split_file_upload` when:
+- You need parsers other than `title`, `regex`, or `smart`
+- The file is on your local machine
+- The server doesn't have direct access to the file path
+
+**Examples:**
+
+```python
+# Text splitting (only title, regex, smart)
+result = client.chunk.split_text(
+    text="# Chapter 1\n\nContent...",
+    parser_id="title"
+)
+
+# File splitting from local path
+result = client.chunk.split_file(
+    file_path="/server/path/doc.pdf",
+    parser_id="book"  # Can use any parser
+)
+
+# File splitting from URL
+result = client.chunk.split_file(
+    file_url="https://example.com/doc.pdf",
+    parser_id="naive"
+)
+
+# Upload and split
+result = client.chunk.split_file_upload(
+    file_path="/local/path/doc.pdf",
+    parser_id="qa"
+)
+```

How did I do? Any feedback?  Join Discord

@dosubot dosubot bot added documentation Improvements or additions to documentation enhancement New feature or request labels Jan 28, 2026
@whhe whhe requested a review from Copilot February 5, 2026 13:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements file splitting functionality for PowerRAG, enabling users to chunk documents via local paths, URLs, and uploads using various parser types. The changes enhance the existing text-only splitting capabilities by adding support for file-based parsing methods from the rag/app module.

Changes:

  • Added split_file and split_file_upload methods to support file chunking through local paths, URLs, and direct uploads
  • Enhanced error handling with more descriptive messages for unsupported parsers in text splitting
  • Updated documentation with comprehensive examples distinguishing between text and file splitting methods

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
powerrag/server/services/split_service.py Adds _init_file_chunker_factory and split_file method to support file-based chunking with all parser types; improves error messages for unsupported text parsers
powerrag/server/routes/powerrag_routes.py Implements /split/file and /split/file/upload endpoints; changes ConnectionError status code from 503 to 400
powerrag/sdk/modules/chunk_manager.py Adds split_file and split_file_upload client methods with support for both file paths and URLs
powerrag/sdk/tests/test_chunk.py Adds tests for file splitting upload functionality and unsupported parser error handling; changes test parser from "naive" to "regex"
powerrag/sdk/tests/test_document.py Adds sleep delays for async operation timing in cancel_parse test
powerrag/sdk/README.md Comprehensive documentation updates clarifying split_text vs split_file usage, with examples for all three file splitting methods
api/apps/sdk/powerrag_proxy.py Adds proxy endpoints for split_file operations; improves file handling with BytesIO wrapper for async file reading

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 11, 2026
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:XXL This PR changes 1000+ lines, ignoring generated files. labels Feb 11, 2026
@whhe whhe merged commit fd6e540 into oceanbase:main Feb 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants